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Abstract 

<N 

^ Passivity-based control for port-Hamiltonian systems provides an intuitive way of achieving stabilization by rendering a system 
passive with respect to a desired storage function. However, in most instances the control law has to be calculated by solving a 
complex partial differential equation (PDE). This paper considers energy-balancing passivity-based control (EB-PBC), which 
is a form of PBC in which the closed-loop energy is equal to the difference between the stored and supplied energies. We propose 
a method to parameterize EB-PBC that preserves the systems's PDE matching conditions, does not require the specification 
^Sj of a global desired Hamiltonian, includes performance criteria, and is robust to extra non-linearities such as control input 
saturation. The parameters of the control law are found using actor-critic reinforcement learning, enabling learning near- 
optimal control policies satisfying a desired closed-loop energy landscape. The advantages are that near-optimal controllers 
can be generated using standard energy shaping techniques and that the solutions learned can be interpreted in terms of 
energy shaping and damping injection, which makes it possible to numerically assess stability using passivity theory. From 
• the reinforcement learning perspective, our proposal allows for the class of port-Hamiltonian systems to be incorporated in 
the actor-critic framework, speeding up the learning thanks to the resulting parameterization of the policy. The method has 
been successfully applied to the pendulum swing-up problem in simulations and real-life experiments. 
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1 Introduction 

Passivity-based control (PBC) [16] is a methodology 
that achieves the control objective by rendering a system 
passive with respect to a desired storage function [17]. 
Different forms of PBC have been successfully applied 
to design robust controllers [23] for mechanical systems 
and electrical circuits [17,18]. A key feature of PBC is 
that it exploits structural properties of the system. In 
this paper, we are interested in the passivity-based con- 
trol of systems endowed with a special structure, called 
port-Hamiltonian (PH) systems. PH systems have been 
widely used in PBC applications [5,19]. Their geometric 
structure allows reformulating a PBC problem in terms 
of solving a set of partial differential equations (PDE's). 
Much research in the literature concerns solving or sim- 
plifying such generally complex PDE's [17]. 

The drive for passivity-based control of port Hamilto- 
nian systems is grounded in the search for global stabil- 
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ity, thus strongly relying on models. Other control tech- 
niques have been developed when no models are known 
and performance is important. One such example is re- 
inforcement learning (RL) [21]. RL is a semi-supervised 
learning control method that can solve optimal (stochas- 
tic) control problems for nonlinear systems, without the 
need for a process model or for explicitly solving complex 
equations. In RL the controller receives an immediate 
numerical reward as a function of the process state and 
possibly control action. The goal is to find an optimal 
control policy that maximizes the cumulative long-term 
rewards, which corresponds to maximizing a value func- 
tion [21]. In this paper, we use actor-critic techniques 
[11] , which are a class of RL methods in which a separate 
actor and critic are learned. The critic approximates the 
value function and the actor the policy (control law). 
Actor-critic reinforcement learning is suitable for prob- 
lems with continuous state and action spaces. A general 
disadvantage of RL is that the progress of learning can 
be very slow and non-monotonic. However, by incorpo- 
rating (partial) model knowledge, learning can be sped 
up [9]. 

In this paper we address two important issues: First, 
we propose a learning control structure within the PH 
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framework that retains important properties of the PBC. 
To this end, first a parameterization of a particular type 
of PBC, called energy-balancing passivity-based control 
(EB-PBC), is proposed such that the PDE arising in EB- 
PBC can be split into a non- assignable part satisfying a 
matching condition following from the EB-PBC frame- 
work and an assignable part that can be parameterized. 
Then, by applying actor-critic reinforcement learning 
the parameterized part can be learned while automati- 
cally verifying the matching PDE. This can be seen as a 
paradigm shift from the traditional model-based control 
synthesis for PH systems: we do not seek to synthesize a 
controller in closed-form, but we aim instead to learn one 
with proper structural constraints. This brings a number 
of advantages: I) It allows to specify the control goal in a 
"local" fashion through a reward function, without hav- 
ing to consider the entire global behavior of the system. 
The simplest example to illustrate this idea is by consid- 
ering a reward function to be 1 when the system is in a 
small neighborhood of the desired goal and everywhere 
else [21]. The learning algorithm will eventually find a 
global control policy. In the model-based PBC synthesis 
counterpart one needs to specify a desired global Hamil- 
tonian. II) Learning brings performance in addition to 
the intrinsic stability properties of PBC. The structure 
of RL is such that the rewards are maximized, and these 
can include performance criteria, such as minimal time, 
energy consumption, etc. Ill) Learning offers additional 
robustness and adaptability since it tolerates model un- 
certainty in the PH framework. 

From a learning control point of view, we present a sys- 
tematic way of incorporating a priori knowledge into 
the RL problem. The approach proposed in this paper 
yields, after learning, a controller that can be interpreted 
in terms of energy shaping control strategies. The same 
interpretability is typically not found in the traditional 
RL solutions. 

Thus, this work combines the advantages of both afore- 
mentioned control techniques, PBC and RL, and mit- 
igates some of their respective disadvantages. Histori- 
cally, the trends in control synthesis have oscillated be- 
tween performance and stabilization. PBC of PH sys- 
tems is rooted in the stability of multi-domain nonlin- 
ear systems. By including learning we aim to address 
performance in the PH framework. In the experimental 
section of the paper, we show that our method is also 
robust to unmodeled nonlinear it ies, such as control in- 
put saturation. Control input saturation in PBC for PH 
systems has been addressed explicitly in the literature 
[1,3,6,8,13,14,20]. We show that our approach solves the 
problem of control input saturation on the learning side 
without the need of augmenting the model-based PBC. 

The work presented in this paper draws an interest- 
ing parallel with the application of iterative feedback 
tuning (IFT) [10] in the PH framework [7]. Both tech- 
niques optimize the parameters of the controller on- 



line, with the difference that in IFT the objective is 
to minimize the error between the desired output and 
the measured output of the system, while our approach 
aims at maximizing a reward function, that can be very 
general. The choice of RL is warranted by its semi- 
supervised paradigm, as opposed to other traditional 
fully-supervised learning techniques, such as artificial 
neural networks or fuzzy approximators where the con- 
trol specification (function approximation information) 
is input /output data instead of reward functions. Such 
fully-supervised techniques can be used within RL as 
function approximators to represent the value functions 
and control policies. Genetic algorithms can also be con- 
sidered as an alternative to RL since they rely on fitness 
functions that are analogous to the rewards functions of 
RL. We aim to explore such classes of algorithms in our 
future work. 

The theoretical background on PH systems and actor- 
critic reinforcement learning is described in Section 2 
and Section 3, respectively. In Section 4, our proposal for 
a parameterization of input-saturated EB-PBC control, 
compatible with actor-critic reinforcement learning, is 
introduced. We then specialize this result to mechanical 
systems in Section 5. Section 6 provides simulation and 
experimental results for the problem of swinging up an 
input-saturated inverted pendulum and Section 7 con- 
cludes the paper. 



2 Port-Hamiltonian Systems 

Port-Hamiltonian (PH) systems are a natural way of 
representing a physical system in terms of its energy 
exchange with the environment through ports [18]. The 
general framework of PH systems was introduced in [15] 
and was formalized in [24,23]. In this paper, we consider 
the input-state-output representation of the PH system 
which is of the fornPH 

^ r i = [J{x) - R[x)] V^H{x) + g{x)u 

'\y = g^{x)V:,H{x) ^ ^ 

where x G is the state vector, u G M^, m < n 
is the control input, J,R '.W ^ R^^^ with J{x) = 
— J{x)^ and R{x) = R{x)^ > are the interconnection 
and damping matrix, respectively, H : R the 

Hamiltonian which is the stored energy in the system, 
u^y G are conjugated variables whose product has 
the unit of power and g : R^ j^nxm input matrix 
assumed to be full rank. For the remainder of this paper, 
we denote: 

F{x) := J{x) - R{x) (2) 



^ We use the notation Vx d/dx. Furthermore, all (gra- 
dient) vectors are column vectors. 
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This matrix satisfies F{x) + F'^{x) = -2R{x) < 0. Sys- 
tem (1) satisfies the power-balance equation [17]: 



= - iV^H{x)f R{x)V^H{x) + u^y 
Since R{x) > 0, we obtain: 

H{x) < u^y 



(3) 



(4) 



which is cahed the passivity inequahty, if H{x) is positive 
semi- definite, and cyclo-passivity inequahty, if H{x) is 
not positive semi-definite nor bounded from below [17]. 
Hence, systems satisfying (4) are caUed (cyclo-) passive 
systems. The goal is to obtain the target closed-loop 
system: 



(5) 



through energy shaping using EB-PBC [17] and damp- 
ing injection^ such that i?d(^) is the desired closed-loop 
energy which has a minimum at the desired equilibrium 
X* and satisfies: 

which implies (cyclo-) passivity according to (3)- (4) if the 
desired damping Rd{x) > 0. Hence, the control objective 
is achieved by rendering the closed-loop system passive 
with respect to the desired storage function i^d (x) . 

2.1 Energy Shaping 

Define the added energy function: 

H^{x) := Ha{x) - H{x) (7) 

A state- feedback law Ues{x) is said to satisfy the energy- 
balancing property if it satisfies: 



H^{x) = -ul^{x)y 



(8) 



If (8) holds, the desired energy H^{x) is the difference be- 
tween the stored and supplied energy. Assuming g{x) G 
j^nxm^ m < n, rank {g{x)} = m, the control law: 



i^es(^) = g\x)F{x)V:j,H^{x) 



(9) 



with g^[x) = {g^{x)g{x))-^g^{x) solves the EB-PBC 
problem with Hq^{x) a solution of the following set of 
PDE's [17]: 



'g^{x)F^{x) 



(10) 



with g^{x) e r(^-^)x^ the fuh rank left-annihilator of 
g{x), i.e. g^{x)g{x) = 0. 



2. 2 Damping Injection 

Damping is injected by feeding back the (new) passive 
output g^{x)VxHd{x), 

u^,{x) = -K{x)g^{x)VM^) (11) 
with K{x) e ]R^><^, K{x) = K^{x) > such that: 

R^{x) = R{x) + g{x)K{x)g^{x) (12) 

Hence, the full control law consists of an energy shaping 
part and a damping injection part: 

U{x) = Ues{x) +l^di(^) 

= g\x)F{x)VM^) 

-K{x)g^{x)VM^) (13) 

3 Actor- Critic Reinforcement Learning 

In reinforcement learning, the system to be controlled 
(called 'environment' in the RL literature) is modeled 
as a Markov decision process (MDP). In a deterministic 
setting, this MDP is defined by the tuple M(X, U, /, p), 
where X is the state space, U the action space and / : 
X X U ^ X the state transition function that describes 
the process to be controlled that returns the state Xk-\-i 
after applying action Uk in state Xk- The vector Xk is 
obtained by applying a zero-order hold discretization 
Xk = xikTs) with Tg the sampling time. The reward 
function is defined hy p : XxU ^ M and returns a scalar 
reward r^^+i = p{xk+i^Uk) after each transition. The 
goal of RL is to find an optimal control policy n \ X ^ U 
by maximizing an expected cumulative or total reward 
described as some function of the immediate expected 
rewards. In this paper, we consider a discounted sum of 
rewards. The value function : X ^ M, 



V-{x) = Y,l'rl^i 

i=0 

oo 

= ^l'p{xk+i+i,T^{xk+i)), x = Xk (14) 



i=0 



approximates this discounted sum during learning while 
following policy tt where 7 G [0, 1) is the discount factor. 

When dealing with large and / or continuous state and ac- 
tion spaces, it is necessary to approximate the value func- 
tion and policy. Actor-critic (AC) algorithms [4,11] learn 
a separate actor (policy tt) and critic (value function 
V^). The critic approximates and updates (improves) 
the value function. Then, the actor's parameters are up- 
dated in the direction of that improvement. The actor 
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and critic are usually defined by a differentiable param- 
eterization such that gradient ascent can be used to 
update the parameters. This is beneficial when dealing 
with continuous action spaces [22]. In this paper, the 
temporal-difference based Standard Actor-Critic (S-AC) 
algorithm from [9] is used. Define the approximated pol- 
icy TT : ^ and the approximated value function 
as y : ^ M. Denote the parameterization of the ac- 
tor by '^9 G and of the critic by (9 G M^. The temporal 
difference [21]: 

4+1 •= rk+i-^-iyi^k+i^Ok) -V{xk:Ok) (15) 

is used to update the critic parameters using the follow- 
ing gradient ascent update rule: 

Ok+i =0k-\- Q^c4+i V^y(xfc, Ok) (16) 

in which o^c > is the learning rate. Eligibility traces 
ek G M.^ [21] can be used to speed up learning by includ- 
ing reward information about previously visited states. 
The update for the critic parameters becomes: 



Ok+i = Ok Qfc^+ie/c+i 



(17) 
(18) 



with A G [0, 1) the trace-decay rate. The policy approxi- 
mation can be updated in a similar fashion, as described 
below. RL needs exploration in order to visit new, un- 
seen parts of the state-action space so as to possibly find 
better policies. This is achieved by perturbing the pol- 
icy with a exploration term Auk • Many techniques have 
been developed for choosing the type of exploration term 
(see e.g. [2]). In this paper we consider Auk to be random 
with zero-mean. In the experimental section we choose 
Auk to be have a normal distribution. The overall con- 
trol action now becomes: 



Uk = 7^{Xk,'^k) + ^Uk 



(19) 



The policy update is such that the policy parameters 
are updated towards (away from) Auk if the temporal 
difference (15) is positive (negative). This leads to the 
following policy update rule: 

'i^/c+i = 'i^/c + aJk+i^UkV ^T:{xk,'dk) (20) 
with Q^a > the actor learning rate. 

4 Energy-Balancing Actor-Critic 

In this section we present our main results. Our ap- 
proach is that we will use the PDE (10) and split it into 
an assignable, parameterizable part and an unassignable 
part that satisfies the matching condition. In this way, 
it is possible to parameterize the desired closed-loop 



Hamiltonian i^d(^) and simultaneously satisfy (10). Af- 
ter that, we parameterize the damping matrix K{x). The 
two parameterized variables — the desired closed-loop 
energy H^{x) and damping K{x) — are then suitable 
for Actor- Critic RL by defining two actors for these vari- 
ables. First, we reformulate the PDE (10) in terms of 
the desired closed-loop energy H^{x) by applying (7): 



'g^{x)F^{x) 
9^{x) 
15 



(V,i/d(^)-V,i/(x))=0 (21) 



and we denote the kernel oi A{x) as: 

ker(A(x)) = {N{x) G M^^^ : A{x)N{x) = 0} (22) 

such that (21) reduces to: 

VM^) - ^xH{x) = N{x)a (23) 

with a G M^. Suppose that (an example is given fur- 
ther on) the state vector x can be split, such that x = 
[w^ z^Y ^ where z E W- and w G M^, c -\- d = n cor- 
responding to the zero and non-zero elements of N{x) 
such that: 



a (24) 



We assume that the matrix N^{x) is rank which is 
always true for fully actuated mechanical systems (see 
Section 5). It is clear that \/zH^{x) = V zH{x)^ which 
we call the matching condition, and hence V zH(^{x) can- 
not be chosen freely. Thus, only the desired closed-loop 
energy gradient vector WyjH^{x) is free for assignment. 
We consider a ^-parameterized total desired energy with 
the following form: 











'n^{x) 






y,H{x)_ 








Ha{x, £,) := H{x) + e<pH{w) + H^iw) + C 



(25) 



where ^^(I)h{w) represents a linear- in- parameters basis 
function approximator (^ G a parameter vector and 
(j)H{^) ^ the basis function, with e chosen sufficiently 
large to represent the assignable desired closed-loop en- 
ergy), H^{w) is an arbitrary function of and C chosen 
to render H^^ix^^^) non- negative. The function H(^{x,^) 
automatically verifies (24). The elements of the desired 
damping matrix K{x) of (13), denoted K{x, ^), can be 
parameterized in a similar way: 



[^(x,^)],,-=^[^],,,[0k(x)]z 



(26) 



1=1 
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with ^ e M^x^x/ and 



must be adjusted to respect the saturation: 



miji = in 



jil 



(27) 



Auk = Uk -TT{xk,^k,'>Pk) 



(32) 



a parameter vector such that ^(x, ^) = ^^(x, 

= 1, . . . ,m and (j)K{x) G W basis functions. We 
purposefuhy do not impose K{x^^) > to ahow the 
injection of energy in the system via the damping term. 
Although this breaches the passivity criterion of (4) 
we shah see that local stability can still be numerically 
demonstrated using passivity analysis in Section 6.3. 
This choice is made based on the knowledge that the 
standard Energy Balancing PBC method (without any 
extra machinery to accommodate saturation) cannot 
stabilize in the up position a saturated-input pendulum 
system starting from the down position. As such, this 
choice illustrates the power of RL in finding alterna- 
tive routes to obtain control policies, such as injecting 
energy though the damping term. In other settings en- 
forcing that K{x, ^) > benefits the stability analysis. 
The control law (13) now becomes (when no ambigu- 
ity is present, the function arguments are dropped to 
improve readability) : 







K(x,^)^^(x)V,/fd(x,0 



(28) 



Now, we are ready to introduce the update equations for 
the parameter vectors ^, Denote by [^/e]ij the 

value of the parameters at the discrete time step k. The 
policy TT of the actor-critic reinforcement learning algo- 
rithm is chosen equal to the control law parameterized 
by (28): 

7r(x/e, ^/c) •= u{Xk,^k, ^/e) (29) 

In this paper we take the control input saturation prob- 
lem into account by considering a generic saturation 
function <^:W^ ^ S, S CW^, such that: 



<^{u{x)) eS\fu 



(30) 



where S is the set of valid control inputs. The control 
action with exploration (19) becomes: 



(31) 



where Auk is drawn from a desired distribution. The 
exploration term to be used in the actor update (20) 



Furthermore, we obtain the following gradients of the 
saturated policy: 

V5?(7r) = V*?(7r)V^7r (33) 
V[*]„?(7r) = V*?(7r)V[*]„7r (34) 

Although not explicitly indicated in the previous equa- 
tions, the (lack of) differentiability of the saturation 
function has to be considered for the problem at hand 
such that the computation of the gradient of <^ can be 
made. For a traditional saturation in G M of the form 
mdix{umin,^^^{umax,Ui)), i.e. assumiug each input Ui 
is bounded by Umin and Umax^ then the gradient of ^ is 
the zero matrix outside the unsaturated set S (i.e. when 
Ui < Umin OT Ui > Umax)- For othcr typcs of Saturation 
the function must be computed. Finally, the actor 
parameters J^/e]ij are updated according to (20), re- 
specting the saturated policy gradients. For the param- 
eters of the desired Hamiltonian we obtain: 

^fe+i = ^fe + Q;a,^4+i AiZfeV^^ {7r{xk,Ck: ^/c)) (35) 
and for the parameters of the desired damping we have: 

<^a,[^],//c+i A'?ifeV[^^] . (7r(xfc, ^fc, ^fc)) (36) 

where (i,j) = l,...,m, while observing (27). Algo- 
rithm 1 gives the entire Energy-Balancing Actor-Critic 
algorithm with input saturation. 

The dynamics of the Energy- Balancing Actor Critic Al- 
gorithm 1 raises a number of questions regarding stabil- 
ity and convergence: are the good stability properties of 
the traditional energy-balancing PBC lost? In effect this 
is not the case, as if the parameter ^ is fixed then stabil- 
ity is preserved, in the sense that the system is passive 
to the storage function JYd(^,0- The question is then 
if during learning (while the parameter ^ is evolving) 
will the Hamiltonian H^{x,£^) capture the desired con- 
trol specification. One cannot assume that the desired 
Hamiltonian will immediately fulfill the control speci- 
fication, since if that was the case then no learning is 
needed. In the RL community it is generally accepted 
that during learning no stability and convergence guar- 
antees can be given [21], as exploration is a necessary 
component of the framework. In our framework, we can- 
not guaranteed convergence during learning, but by con- 
straining the desired Hamiltonian we can prevent the to- 
tal energy to grow unbounded, avoiding possible insta- 
bilities. Another relevant question is: will RL converge 
in this setting? The constrained Hamiltonian results in 
sub-optimal policies from the RL side as the represen- 
tation of the policy is not entirely free. In practice, as 



5 



Algorithm 1 Energy- Balancing Actor-Critic 



Input: System (1), A, 7, o^a for each actor, o^c- 
1: eo{x) = \/x 
2: Initialize xq, Oq, ^q? ^0 
3: k^l 
4: loop 



5: Execute: 

6: Draw Auk ^ Af{0, cr^), calculate action Uk = {7r{xk, ^k.i^k) + ^Uk), ^Uk = Uk - 7r{xk, C/c, i^k) 

7: Observe next state Xk-\-i and calculate reward rk-\-i = p{xk+i^Uk) 

8: Critic: 

9: Temporal difference: 4+i = r/e+i + ^V{xk+i,Ok) - V{xk, Ok) 

10: Eligibility trace: e/e+i = ^Xck + ^/^^(xfe, (9/,) 

11: Critic update: 6>/c+i = 6>/c + Q^c^/c+ie/c+i 

12: Actors: 

13: Actor 1 (^d(^, 0)- C/c + Q^a,^(^/c+i AiZfcV^<^ (7r(xfc, ^fc, ^fc)) 

14: Actor 2 (i^(x,^)): 

15: for j = 1, . . . , m do 

16: [^/c+i]ij = [^/c]ij + aa,[^]i//c+i A'?ifcV[^^],^.^ (7r(xfe, ^fe, ^fe)) 

17: end for 



18: end loop 



presented next in Section 6, we observe that not only 
will the RL component converge very fast (faster then 
traditional model-free RL) but throughout learning the 
system never gets unstable. Additionally, the resulting 
policy performs as well as standard model-free approxi- 
mated RL algorithms. The last point to consider is that 
since the Energy Balancing PBC suffers from the dissi- 
pation obstacle [17], hmiting its apphcability to special 
classes of systems such as mechanical systems, the al- 
gorithm we present contains the same limitation. Elim- 
inating such limitation is open research for future work, 
"hello" 



5 Mechanical Systems 

To illustrate an application of the method, consider a 
fully actuated mechanical system of the form: 







" / " 






+ 


'0' 


P. 




-J -R 




ypH{q,p)_ 




P. 



G 



(37) 



with q G M^, p G (n = f , n even) the generalized 
positions and momenta, respectively, G = I and R G 
]^nxn damping matrix. The system admits (1) with 
^ > and the Hamiltonian: 

H{q,p) = ^p^M-\q)p^P{q) (38) 

with M{q) = M^{q) > the inertia matrix and P{q) 
the potential energy. For the system (37) it holds that 



rank {g{x)} = n and the state vector can be split into 
part w = [^1,^2, . . . ,gn]^ and part z = [pi,P2, • • • ^PnV- 
Since g{x) = [0 7]-^ its annihilator can be written as 
g^{x) = [g{x) 0], for an arbitrary matrix g{x). This 
means that only the potential energy can be shaped, 
which is widely known in EB-PBC for mechanical sys- 
tems. The approximated desired closed- loop energy (25) 
reads: 



Hd{x,0 = ^P^M-\q)p + e<l>H{q) (39) 

where the first term represents the unassignable part, i.e. 
the kinetic energy of the system Hamiltonian (38), and 
the second term ^^(j)H{(i) the assignable desired poten- 
tial energy. The actor updates can be defined for each pa- 
rameter according to (35)-(36). For underactuated me- 
chanical systems, e.g. G = [0 /]^, the split state vector z 
is enlarged with those ^-coordinates that cannot be ac- 
tuated directly because these coordinates correspond to 
the zero elements of N{x), (i.e. the matrix Nq{x) is no 
longer rank ft) . 



6 Example: Pendulum Swing-up 

To validate our method, the problem of swinging up an 
inverted pendulum subject to control saturation is stud- 
ied in simulation and using the actual physical setup de- 
picted in Fig. 1. 

The pendulum swing-up is a low-dimensional, but highly 
nonlinear control problem commonly used as a bench- 
mark in the RL literature [9] and it has also been stud- 
ied in PBC [3]. The equations of motion admit (37) and 
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Fig. 1. Inverted pendulum setup. 



read: 



1 

-m. 



VpH{q,p)_ 



0^ 



VpH{q,p)_ 



(40) 

with q the angle of the pendulum and p the angular 
momentum, thus we denote the full measurable state 
X = [q^p]^ . The damping term is: 



R{q) = \ 



if2 

p 

Rr, 



(41) 



for which it holds that R{q) > 0, Vg'. Furthermore, we 
denote the Hamiltonian: 



H{q,p) = ^^+P{q) 



(42) 



with: 

P{q) = M^gX{l^cosq) (43) 
The model parameters are given in Table 1. The desired 
Table 1 

Inverted pendulum model parameters 



Model parameters 


Symbol 


Value 


Units 


Pendulum inertia 


jp 


1.90-10"^ 


kgm^ 


Pendulum mass 


Mp 


5.2 • 10"^ 


kg 


Gravity 


9p 


9.81 




Pendulum length 


Ip 


4.20-10"^ 


m 


Dynamic friction 


bp 


2.48 • 10-^ 


Nms 


Static friction 




1.0 • 10"^ 


N 


Torque constant 


Kp 


5.60-10-^ 


Nm/A 


Rotor resistance 


Rp 


9.92 





Hamiltonian (25) reads: 



2Jo 



(44) 



Only the potential energy can be shaped that we denote 
by Pd(^,0 = (.^4^h{q)' Furthermore, as there is only 
one input, K{x^ ^) becomes a scalar: 



Thus, control law (28) results in: 



(45) 







" {^'^Vg<PH + M^gplp sm{q)) (46) 



Rr 



(47) 



which we define as the policy 7r(x, ^, ip). Hence, we have 
two actor updates: 

ik+l = & + aa,^(^/c+l Aii/eV^c^ (7r(x/e, ^/c, V^fc)) (48) 
= V^/c + aa,^4+l Aii/eV^c^ (7r(x/e, ^/e, V^fc)) (49) 

for the desired potential energy Pd{q^ desired 
damping i^(x,?/^), respectively. 

6. 1 Function Approximation 

To approximate the critic and the two actors, function 
approximators are necessary. In this paper we use the 
Fourier basis [12] because of its ease of use, the possibil- 
ity to incorporate information about the symmetry in 
the system and the ability to ascertain properties useful 
for stability analysis of this specific problem. The pe- 
riodicity of the function approximators obtained via a 
Fourier basis is compatible with the topology of the con- 
figuration space of the pendulum, defined to be 5*^ x R. 
We define a multivariate A/'th-ordeiPl Fourier basis for 
n dimensions as: 



^) = ^ cos(7rcf x) 



(50) 



with Ci G Z^, which means that all possible A/' + l integer 
values, or frequencies, are combined in a vector in to 
create a matrix c G ^"^x(^+i)'^ containing all possible 
frequency combinations. For example. 



[0 Op 



C2 



[1 0]^ 



, C(3+i)2 = [4 4]^ (51) 



^ 'Order' refers to the order of approximation; 'dimensions' 
to the number of states in the system. 
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for a 3^^-order Fourier basis in 2 dimensions. The state 
X is scaled according to: 



used in e.g. [9]. For the critic, we define the basis func- 
tion approximation as: 



(52) 



(56) 



for i = 1, . . . , n with {xi^ 



mm 7 -^2, max 



) = (-1,1). Pro- 



jecting the state variables onto this symmetrical range 
has several advantages. First, this means that the policy 
will be periodic with period T = 2, such that it wraps 
around (i.e., modulo 27r) and prevents discontinuities at 
the boundary values of the angle {x = [tt ± e,p], e very 
small). Second, learning will be faster because updating 
the value function and policy for some x also applies to 

the sign-opposite value of x. Third, Pd(0,0 = by the 
choice of parameterization, which is beneficial for stabil- 
ity analysis. Although the momentum will now also be 
periodic in the value function and policy, this is not a 
problem because the value function and policy approx- 
imation are restricted to a domain and the momentum 
itself is also restricted to the same domain using satura- 
tion. We adopt the adjusted learning rate from [12] such 
that: 

for i = 1, . . . , (A/" + 1)^ with o^ab,.^? ^ab,V^ ^^e base learn- 
ing rate for the two actors (Table 2) and o^ai,^ = <^ab,<^5 
<^ai,V' = <^ab,V' avoid division by zero for ci = [0 0]-^. 
Equation (53) implies that parameters corresponding 
to basis functions with higher (lower) frequencies are 
learned slower (faster). The parameterizations described 

above result in iyd(^*, ^ = for all ^, where x* = [0, 0]^ 
is the goal state. This entails that the goal state is a crit- 
ical point of the Hamiltonian throughout the learning 
process, in effect speeding up the RL convergence. 

6.2 Simulation 

The task is to learn to swing up and stabilize the pendu- 
lum from the initial position pointing down Xq = [tt, 0]"^ 
to the desired equilibrium position at the top = 
[0, 0]^. Since the control action is saturated, the system 
is not able to swing up the pendulum directly, but rather 
it must swing back and forth to build up momentum to 
eventually reach the equilibrium. The reward function p 
is defined such that it has its maximum in the desired 
unstable equilibrium and penalizes other states via: 



p{x, u) = Qr (cos(g) - 1) - RrP^ 



with: 



Qr = 25 , Rr 



OA 
Jp 



(54) 



(55) 



This reward function is consistent with the mapping 
IR for the angle and proved to improve perfor- 
mance over a purely quadratic reward, such as the one 



with 0c (^) ci 3^^-order Fourier basis resulting in 16 
learnable parameters 6 in the domain [qmin^Qmax] x 

[Pmin.Pmax] = ["TT, Tt] X [StT Jp, StT Jp]. Actor 1 

(Pd(^7 0) is parameterized using a 3'^^-order Fourier 
basis in the range [— 7r,7r] resulting in 4 learnable pa- 
rameters. Actor 2 (iC(x,'0)) is also parameterized using 
a 3^^-order Fourier basis for the full state space, in the 
same domain as the critic. Exploration is done at every 
time step by randomly perturbing the action with a nor- 
mally distributed zero-mean white noise with standard 
deviation cr = 1, i.e.: 



Au - A/'(0, 1) 



(57) 



We incorporate saturation by defining the saturation 
function (30) as: 

^K) = { , , , . 

[ sgn[Uk)Uma^ otherwise 

Recall that the saturation must be taken into account in 
the policy gradients by applying (33)-(34). The param- 
eters were all initialized with zero vectors of appropri- 
ate dimensions, i.e. (^o, ^o, V^o) = 0. The algorithm was 
first run with the system simulated in Matlab for 200 
trials of three seconds each (with a near-optimal policy, 
the pendulum needs approximately one second to swing 
up) . Each trial begins in the initial position xq . This sim- 
ulation was repeated 50 times to get an estimate of the 
average, minimum, maximum and confidence regions for 
the learning curve. The simulation parameters are given 
in Table 2. 

Table 2 

Simulation parameters 



Simulation parameters 


Symbol 


Value 


Units 


Number of trials 




200 




Trial duration 


Tt 


3 


s 


Sample time 


Ts 


0.03 


s 


Decay rate 


7 


0.97 




Eligibility trace decay 


A 


0.65 




Exploration variance 




1 




Max control input 


Umax 


3 


V 


Learning rate of critic 


ac 


0.05 




Learning rate of Pd{q,0 


Q^ab,e 


1 X 10"^° 




Learning rate of K(x, ip) 


ttab,V^ 


0.2 
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-1000 



trial 


-2000 - 






Q. 




W 

■g 


-3000 


CO 














-4000 - 


o 




E 








C/D 


-5000 - 




-6000 - 

























I'i'l / "m' 




1 I' 


''"'7 'Ji'i 




1 ' 
1 ' 


-1 i 


95% confidence region for the mean 




IVIean 






IVlax and min bounds 



2 4 6 8 10 

Time [min] 

Fig. 2. Results for the EBAC method for 50 learning simu- 
lations. 



in these wells. The number of these undesirable wells is 
a function of the control saturation and of the number 
of basis functions chosen to approximate PdC^^O- The 
learned damping K{x^ ip) (Fig. 4b) is positive (white) 
towards the equilibrium thus extracting energy from the 
system, while it is negative (gray) in the region of the 
initial state. The latter corresponds to pumping energy 
into the system, which is necessary to build up momen- 
tum for the swing-up and to escape the undesirable wells 
of Pd((7, ^). A disadvantage is that control law (47), with 
the suggested basis functions, is always zero for the set 
Q = {x \ X = {0 -\- jTT, 0), j = 1,2,...} which implies 
that it is zero not only at the desired equilibrium, but 
also at the initial state xq. During learning this is not a 
problem because there is constant exploration, but after 
learning the system should not be initialized in exactly 
xo otherwise it will stay in this set. It can be overcome 
by initializing with a small perturbation e around xq. In 
real-life systems it will also be less a problem because 
there is generally noise present on the sensors. 




(a) Simulated response 




(b) Desired Hamiltonian 



Fig. 3. Simulation results for the angle q (a, top), momen- 
tum p (a, bottom) and the desired closed-loop Hamiltonian 
Hd{x,^,ip) (b) including the simulated trajectory (black 
dots) using the policy learned. 

Fig. 2 shows the average learning curve obtained after 
50 simulations. The algorithm shows good convergence 
and on average needs about 2 minutes (40 trials) to reach 
a near-optimal policy. The initial drop in performance 
is caused by the zero-initialization of the value function 
(critic), which is too optimistic compared to the true 
value function. Therefore, the controller explores a large 
part of the state space and receives a lot of negative re- 
wards before it learns the true value of the states. A sim- 
ulation using the policy learned in a typical experiment 
is given in Fig. 3a. As can be seen, the pendulum swings 
back once to build up momentum to eventually get to 
the equilibrium. The desired Hamiltonian Hd{x^ (^4), 
acquired through learning, is given in Fig. 3b. There 
are three minima, of which one corresponds to the de- 
sired equilibrium. The other two equilibria are undesir- 
able wells that come from the shaped potential energy 
Pd{Q:0 (I^ig- 4a). These minima are the result of the 
algorithm trying to swing up the pendulum in a sin- 
gle swing, which is not possible due to the saturation. 
Hence, a swing- up strategy is necessary to avoid staying 



^ 0.5 

1 
o 

^-0.5 




(a)Pd(9,0 



(b) sgn (i^(x,V^))) 



Fig. 4. Desired potential energy (a) and desired damping (b) 
(gray: negative; white: positive) for a typical learning exper- 
iment. The black dots indicate the value of the respective 
quantity for the simulation of Fig. 3a. 



6.3 Stability of the Learned Controller 

Since control saturation is present, the target dynamics 
do not satisfy (5). Hence, to conclude local stability of 

based on (6), we calculate H(i{x^^) for the unsaturated 

case (Fig. 5a)[^and the saturated case (i^d,sat(^7 0) 
compute the sign of the difference (Fig. 5b). By looking 
at Fig. 5b, it appears that 36 C : \x — x'^l < S such 

that /fd,sat(^7 = H(^{x^ ^). It can be seen from Fig. 5b 
that such a 6 exists, i.e., a small gray region around the 

equilibrium exists. Hence, we can use Hd{x^ around 
X* and assess stability using (6). From Fig. 3b it follows 



^ Fig. 5a is sign-opposite to Fig. 4b, which is logical, be- 
cause the negative (positive) regions of K{x, ip) correspond 
to negative (positive) damping which corresponds to a pos- 
itive (negative) value of Hd{x,^) based on (6). 
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(a) sgn (^Hd{x,ip)^ 



(b) Hd,difi{x,2p) 



Fig. 5. Signum of Hd{x,ip) (a) indicating posi- 
tive (white) and negative (gray) regions and (b) 

Hd,diH{x,ip) = sgn (^Hd{x,'ip) — Hd,sa.t{x,ip)^ indicat- 
ing regions where Hdix^^ip) = Hd,sat{x,'ip) (gray) and 

Hd{x,ip) ^ Hd,sat{x,ip) (white). Black dots indicate the 
simulated trajectory. 



-1000- 



7c -2000 




95% confidence region for the mean 
IVIean 

IVlax and min bounds 



4 6 
Time [min] 



8 



10 



Fig. 6. Results for the EBAC method for 20 learning exper- 
iments with the real physical system. 



that ^d(^,0 > for all states in Fig. 3b. From Fig. 4a 
we infer that locally, 

argminPdfeO = ^*; A(^*,0=0; Ai(^*,O>0 

(59) 

the latter two of which naturally result from the basis 
function definition. Furthermore, from Fig. 4b it can be 
seen that around x*, K{x^^jj) > 0. Hence, in a region 6 
around x*, 

^d(x,O>0; ^d(x,O<0; ^d(x*,O = (60) 

which implies local asymptotic stability of x* . Exten- 
sive simulations show that similar behaviour is always 
achieved. 



6.4 Real-time Experiments 

Using the physical setup shown in Fig. 1, 20 learning 
experiments were run using identical settings as in the 
simulations. The result is given in Fig. 6. The algorithm 
shows slightly slower convergence - about 3 minutes of 
learning (60 trials) to reach a near-optimal policy instead 
of 40 - and a less consistent average when compared 
to Fig. 2. This can be attributed to a combination of 
model mismatch and the symmetrical basis functions 
(through which it is not possible to incorporate non- 
symmetrical friction that is present in the real system) . 
Overall though, the performance can be considered good 
when compared to the simulation results. Also, the same 
performance dip is present which can again be attributed 
to the optimistic value function initialization. 



7 Conclusions 

In this paper, we have presented a method to system- 
atically parameterize EB-PBC control laws that is ro- 
bust to extra nonlinearities such as input control satu- 
ration. The parameters are then found by making use of 
actor-critic reinforcement learning. In this way, we are 
able to learn a closed-loop energy landscape for PH sys- 
tems. The advantages are that optimal controllers can be 
generated using energy-based control techniques, there 
is no need to specify a global system Hamiltonian, and 
the solutions acquired by means of reinforcement learn- 
ing can be interpreted in terms of energy shaping and 
damping injection, which makes it possible to numeri- 
cally assess stability using passivity theory. By making 
use of the model knowledge the actor-critic method is 
able to quickly learn near-optimal policies. A drawback 
is that for multiple input systems, generating many actor 
updates for the desired damping matrix can be compu- 
tationally expensive. We have found that the proposed 
Energy Balancing Actor Critic algorithm performs very 
well in a physical mechanical setup. Due to the intrinsic 
energy boundedness of the learned desired Hamiltonian, 
we have observed that the system never gets unstable 
during learning. We are currently active on the exten- 
sion of the algorithms presented to an IDA-PBC setting, 
such that more classes of systems can be addressed and 
more freedom is given in shaping the desired Hamilto- 
nian (e.g. Kinetic energy shaping). 
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